unknown state
Reliability Quantification of Deep Reinforcement Learning-based Control
Yoshioka, Hitoshi, Hashimoto, Hirotada
Reliability quantification of deep reinforcement learning (DRL)-based control is a significant challenge for the practical application of artificial intelligence (AI) in safety-critical systems. This study proposes a method for quantifying the reliability of DRL-based control. First, an existing method, random noise distillation, was applied to the reliability evaluation to clarify the issues to be solved. Second, a novel method for reliability quantification was proposed to solve these issues. The reliability is quantified using two neural networks: reference and evaluator. They have the same structure with the same initial parameters. The outputs of the two networks were the same before training. During training, the evaluator network parameters were updated to maximize the difference between the reference and evaluator networks for trained data. Thus, the reliability of the DRL-based control for a state can be evaluated based on the difference in output between the two networks. The proposed method was applied to DQN-based control as an example of a simple task, and its effectiveness was demonstrated. Finally, the proposed method was applied to the problem of switching trained models depending on the state. Con-sequently, the performance of the DRL-based control was improved by switching the trained models according to their reliability.
Sample-optimal classical shadows for pure states
Grier, Daniel, Pashayan, Hakop, Schaeffer, Luke
We consider the classical shadows task for pure states in the setting of both joint and independent measurements. The task is to measure few copies of an unknown pure state $\rho$ in order to learn a classical description which suffices to later estimate expectation values of observables. Specifically, the goal is to approximate $\mathrm{Tr}(O \rho)$ for any Hermitian observable $O$ to within additive error $\epsilon$ provided $\mathrm{Tr}(O^2)\leq B$ and $\lVert O \rVert = 1$. Our main result applies to the joint measurement setting, where we show $\tilde{\Theta}(\sqrt{B}\epsilon^{-1} + \epsilon^{-2})$ samples of $\rho$ are necessary and sufficient to succeed with high probability. The upper bound is a quadratic improvement on the previous best sample complexity known for this problem. For the lower bound, we see that the bottleneck is not how fast we can learn the state but rather how much any classical description of $\rho$ can be compressed for observable estimation. In the independent measurement setting, we show that $\mathcal O(\sqrt{Bd} \epsilon^{-1} + \epsilon^{-2})$ samples suffice. Notably, this implies that the random Clifford measurements algorithm of Huang, Kueng, and Preskill, which is sample-optimal for mixed states, is not optimal for pure states. Interestingly, our result also uses the same random Clifford measurements but employs a different estimator.
Explicit Explore, Exploit, or Escape ($E^4$): near-optimal safety-constrained reinforcement learning in polynomial time
Bossens, David M., Bishop, Nicholas
In reinforcement learning (RL), an agent must explore an initially unknown environment in order to learn a desired behaviour. When RL agents are deployed in real world environments, safety is of primary concern. Constrained Markov decision processes (CMDPs) can provide long-term safety constraints; however, the agent may violate the constraints in an effort to explore its environment. This paper proposes a model-based RL algorithm called Explicit Explore, Exploit, or Escape ($E^{4}$), which extends the Explicit Explore or Exploit ($E^{3}$) algorithm to a robust CMDP setting. $E^4$ explicitly separates exploitation, exploration, and escape CMDPs, allowing targeted policies for policy improvement across known states, discovery of unknown states, as well as safe return to known states. $E^4$ robustly optimises these policies on the worst-case CMDP from a set of CMDP models consistent with the empirical observations of the deployment environment. Theoretical results show that $E^4$ finds a near-optimal constraint-satisfying policy in polynomial time whilst satisfying safety constraints throughout the learning process. We discuss robust-constrained offline optimisation algorithms as well as how to incorporate uncertainty in transition dynamics of unknown states based on empirical inference and prior knowledge.
Stochastic Shortest Path with Adversarially Changing Costs
Rosenberg, Aviv, Mansour, Yishay
Stochastic shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. In this paper we consider adversarial SSPs that also account for adversarial changes in the costs over time, while the dynamics (i.e., transition function) remains unchanged. Formally, an agent interacts with an SSP environment for $K$ episodes, the cost function changes arbitrarily between episodes, and the fixed dynamics are unknown to the agent. We give high probability regret bounds of $\widetilde O (\sqrt{K})$ assuming all costs are strictly positive, and $\widetilde O (K^{3/4})$ for the general case. To the best of our knowledge, we are the first to consider this natural setting of adversarial SSP and obtain sub-linear regret for it.
Exploring Unknown States with Action Balance
Song, Yan, Chen, Yingfeng, Hu, Yujing, Fan, Changjie
Exploration is a key problem in reinforcement learning. Recently bonus-based methods have achieved considerable successes in environments where exploration is difficult such as Montezuma's Revenge, which assign additional bonus (e.g., intrinsic reward) to guide the agent to rarely visited states. Since the bonus is calculated according to the novelty of the next state after performing an action, we call such methods the next-state bonus methods. However, the next-state bonus methods bring extra issues. It may lead agent to be trapped in states that fewer being visited and ignore to explore unknown states. Moreover, the behavior policy of the agent is also influenced by the bonus added to the state (or state-action) values indirectly. In contrast to the bonus-based methods which explore in known states, in this paper, we focus on the other part of exploration: exploration for finding unknown states. We propose the action balance exploration method to overcome the defects of the next-state bonus methods, which balances the chosen time of each action in each state and can be treated as an extension of upper confidence bound (UCB) to deep reinforcement learning. To take both the advantages of the next-state bonus method and our action balance exploration method, we propose the action balance RND method, which takes both parts of exploration into consideration. The experiments on grid world and Atari games demonstrate action balance exploration has a better capability in finding unknown states and can improve the real performance of RND in some hard exploration environments respectively.
Deep, spatially coherent Inverse Sensor Models with Uncertainty Incorporation using the evidential Framework
Bauer, Daniel, Kuhnert, Lars, Eckstein, Lutz
To perform high speed tasks, sensors of autonomous cars have to provide as much information in as few time steps as possible. However, radars, one of the sensor modalities autonomous cars heavily rely on, often only provide sparse, noisy detections. These have to be accumulated over time to reach a high enough confidence about the static parts of the environment. For radars, the state is typically estimated by accumulating inverse detection models (IDMs). We employ the recently proposed evidential convolutional neural networks which, in contrast to IDMs, compute dense, spatially coherent inference of the environment state. Moreover, these networks are able to incorporate sensor noise in a principled way which we further extend to also incorporate model uncertainty. We present experimental results that show This makes it possible to obtain a denser environment perception in fewer time steps.
Learning without recall in directed circles and rooted trees
Rahimian, M. Amin, Jadbabaie, Ali
This work investigates the case of a network of agents that attempt to learn some unknown state of the world amongst the finitely many possibilities. At each time step, agents all receive random, independently distributed private signals whose distributions are dependent on the unknown state of the world. However, it may be the case that some or any of the agents cannot distinguish between two or more of the possible states based only on their private observations, as when several states result in the same distribution of the private signals. In our model, the agents form some initial belief (probability distribution) about the unknown state and then refine their beliefs in accordance with their private observations, as well as the beliefs of their neighbors. An agent learns the unknown state when her belief converges to a point mass that is concentrated at the true state. A rational agent would use the Bayes' rule to incorporate her neighbors' beliefs and own private signals over time. While such repeated applications of the Bayes' rule in networks can become computationally intractable, in this paper, we show that in the canonical cases of directed star, circle or path networks and their combinations, one can derive a class of memoryless update rules that replicate that of a single Bayesian agent but replace the self beliefs with the beliefs of the neighbors. This way, one can realize an exponentially fast rate of learning similar to the case of Bayesian (fully rational) agents. The proposed rules are a special case of the Learning without Recall.
Switching to Learn
Shahrampour, Shahin, Rahimian, Mohammad Amin, Jadbabaie, Ali
Distributed estimation, detection, and learning theory in networks have attracted much attention over the past decades [1], [2], [3], [4], with applications that range from sensor and robotic networks [5], [6], [7], [8], [9] to social and economic networks [10], [11], [12]. In these scenarios, agents in a network need to learn the value of a parameter that they may not be able to infer on their own, but the global spread of information in the network provides them with adequate data to learn the truth collectively. As a result, agents iteratively exchange information with their neighbors. For instance, in distributed sensor and robotic networks, agents use local diffusion to augment their imperfect observations with information from their neighbors and achieve consensus and coordination [13], [14]. Similarly, agents exchange beliefs in social networks to benefit from each other's observations and private information and learn the unknown state of the world [15], [16]. Existing literature on distributed learning focuses mostly on environments where individuals communicate at every round. Of particular relevance to our discussion are a host of algorithms that follow the non-Bayesian learning scheme in Jadbabaie et.
Reinforcement Learning with Partially Known World Dynamics
Reinforcement learning would enjoy better success on real-world problems if domain knowledge could be imparted to the algorithm by the modelers. Most problems have both hidden state and unknown dynamics. Partially observable Markov decision processes (POMDPs) allow for the modeling of both. Unfortunately, they do not provide a natural framework in which to specify knowledge about the domain dynamics. The designer must either admit to knowing nothing about the dynamics or completely specify the dynamics (thereby turning it into a planning problem). We propose a new framework called a partially known Markov decision process (PKMDP) which allows the designer to specify known dynamics while still leaving portions of the environment s dynamics unknown.The model represents NOT ONLY the environment dynamics but also the agents knowledge of the dynamics. We present a reinforcement learning algorithm for this model based on importance sampling. The algorithm incorporates planning based on the known dynamics and learning about the unknown dynamics. Our results clearly demonstrate the ability to add domain knowledge and the resulting benefits for learning.
Probabilistic Anomaly Detection in Dynamic Systems
This paper describes probabilistic methods for novelty detection when using pattern recognition methods for fault monitoring of dynamic systems. The problem of novelty detection is particularly acute when prior knowledge and training data only allow one to construct an incomplete classification model. Allowance must be made in model design so that the classifier will be robust to data generated by classes not included in the training phase. For diagnosis applications one practical approach is to construct both an input density model and a discriminative class model. Using Bayes' rule and prior estimates of the relative likelihood of data of known and unknown origin the resulting classification equations are straightforward.